I tried Creating Amazon EMR Cluster
Introduction:
EMR is widely used by Data analyst for executing massively distributed workloads in the cloud utilizing open-source projects like Apache Hadoop, Apache Spark, Apache Hive, Apache Presto, Apache Pig, and a few more, Amazon offers Amazon EMR.
CREATING AN EMR CLUSTER
- Go to Service ➢ Analytics ➢ EMR.
- Select Create Cluster.
- To construct a cluster, you have two choices: Quick Create or Advanced Options (provides more control and choice). I am Picking Quick Create
- Give your cluster a name, e.g., Developersio-try.
- You can choose to enable logging. All cluster logs will be stored in Amazon S3 as a result.
- Choose the launch mode.
You have two models of launching a cluster:
- Cluster – You create a long-running cluster with a set of applications chosen from the list of apps in the next step.
- Step Execution – With this option, EMR will create a cluster, execute the added steps, and terminate when the steps have completed.
- For the purpose of this exercise, we will choose the Cluster mode of operation.
- Choose a software configuration. Select EMR Release. At the time of this writing, the latest release was emr-6.7.0.
You can select from a list of applications to be configured on the cluster that is being spun up:
- Core Hadoop: Hadoop 2.8.5 with Ganglia 3.7.2, Hive 2.3.6, Hue 4.4.0, Mahout 0.13.0, Pig 0.17.0, and Tez 0.9.2
- HBase: HBase 1.4.10 with Ganglia 3.7.2, Hadoop 2.8.5, Hive 2.3.6, Hue 4.4.0, Phoenix 4.14.3, and ZooKeeper 3.4.14
- Presto: Presto 0.227 with Hadoop 2.8.5 HDFS and Hive 2.3.6 Metastore
- Spark: Spark 2.4.4 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.2
- Trino: Trino 378 with Hadoop 3.2.1 HDFS and Hive 3.1.3 Metastore[new in EMR 6.x]
- For the sake of this exercise, we'll choose Core Hadoop.
You also have the option to choose AWS Glue Data Catalog for Table Metadata. This will provide an option of using an external Hive metastore that you can use with these applications.
- Hardware configuration:
You have to choose the hardware configuration for your cluster, which includes the following settings:
- Type of an instance
- Number of instances (One of the instances will be a master node, and the remaining will act as core nodes.)
-
Cluster scaling - scale cluster nodes based on workload
-
Auto-termination -Terminate cluster when it is idle after xx minutes
- Security & Access configuration:
- You can choose an EC2 key pair. If you don't select an EC2 key pair, you won't be able to SSH into your master node.
- You can choose from two levels of permissions:
- Default Permission – This will use default IAM roles. If roles are not present, they will be automatically created for you with managed policies for automatic policy updates.
- EMR role – EMR_DefaultRole
- EC2 instance profile – EMR_EC2_DefaultRole
- Custom Permission – You can select custom roles to tailor permissions for your cluster.
- Select an EMR role.
- Select an EC2 instance profile.
- Default Permission – This will use default IAM roles. If roles are not present, they will be automatically created for you with managed policies for automatic policy updates.
- Select the Create Cluster option. It will take around 5–7 minutes to spin up a cluster.
Deleting EMR Cluster:
Select terminate
Conclusion:
This is the first time i tried using Amazon EMR And there are to many Services within Amazon EMR which i want to try in future we can also Create the Same by Different tools links are shared bellow
Create using Terraform:
https://dev.classmethod.jp/articles/create-amazon-emr-cluster-with-terraform/
Create using CLI:
aws emr create-cluster --name "developersIO-cluster" --release-label emr-5.28.0 --applications Name=Hive Name=Spark --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge
Create Using SDK:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/calling-emr-with-java-sdk.html